Este conjunto de datos fue creado en nombre de la empresa de Feedzai para contribuir a la evaluación que se hace con aprendizaje automático sobre datos realistas, corrientes, y de gran escala.
Los datos que se usan en el conjunto de datos no están directamente asociados con cuentas bancarias de personas reales, ya que esto violaría la información privada de dichas personas, pero si que son representantes de una población de gente. Cada instancia del conjunto de datos representa una aplicación de apertura de cuenta bancaria sintética derivada de datos reales usando un método CTGAN entrenado con datos del fraude de las aperturas de cuentas bancarias.
El conjunto de datos esta compuesto por tablas de datos separadas, cada una con un millón de instancias, de las cuales solo utilizamos la primera tabla llamada "base" dado a que es la única en la que no se indujo ningún tipo de sesgo en el proceso de sampling.
El objetivo de este problema es predecir, antes de dejar que una persona abrá una cuenta bancaria, la probabilidad de que este acto sea fraudulento. Por tanto, todas las variables del dataset que se empleen para el modelo, deben poderse utilizar en el momento de su llamada. Para ello, se empleará un algoritmo de clasificación supervisado.
Los pasos realizados son:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 5000)
import funciones_auxiliares as f_aux
ruta_folder = "../data/"
df_fraud_original = pd.read_csv(ruta_folder + "Base.csv")
# viendo como es el dataset
df_fraud_original.head()
| fraud_bool | income | name_email_similarity | prev_address_months_count | current_address_months_count | customer_age | days_since_request | intended_balcon_amount | payment_type | zip_count_4w | velocity_6h | velocity_24h | velocity_4w | bank_branch_count_8w | date_of_birth_distinct_emails_4w | employment_status | credit_risk_score | email_is_free | housing_status | phone_home_valid | phone_mobile_valid | bank_months_count | has_other_cards | proposed_credit_limit | foreign_request | source | session_length_in_minutes | device_os | keep_alive_session | device_distinct_emails_8w | device_fraud_count | month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.9 | 0.166828 | -1 | 88 | 50 | 0.020925 | -1.331345 | AA | 769 | 10650.765523 | 3134.319630 | 3863.647740 | 1 | 6 | CA | 185 | 0 | BA | 1 | 0 | 24 | 0 | 500.0 | 0 | INTERNET | 3.888115 | windows | 0 | 1 | 0 | 7 |
| 1 | 1 | 0.9 | 0.296286 | -1 | 144 | 50 | 0.005418 | -0.816224 | AB | 366 | 534.047319 | 2670.918292 | 3124.298166 | 718 | 3 | CA | 259 | 1 | BA | 0 | 0 | 15 | 0 | 1500.0 | 0 | INTERNET | 31.798819 | windows | 0 | 1 | 0 | 7 |
| 2 | 1 | 0.9 | 0.044985 | -1 | 132 | 40 | 3.108549 | -0.755728 | AC | 870 | 4048.534263 | 2893.621498 | 3159.590679 | 1 | 14 | CB | 177 | 1 | BA | 0 | 1 | -1 | 0 | 200.0 | 0 | INTERNET | 4.728705 | other | 0 | 1 | 0 | 7 |
| 3 | 1 | 0.9 | 0.159511 | -1 | 22 | 50 | 0.019079 | -1.205124 | AB | 810 | 3457.064063 | 4054.908412 | 3022.261812 | 1921 | 6 | CA | 110 | 1 | BA | 0 | 1 | 31 | 1 | 200.0 | 0 | INTERNET | 2.047904 | linux | 0 | 1 | 0 | 7 |
| 4 | 1 | 0.9 | 0.596414 | -1 | 218 | 50 | 0.004441 | -0.773276 | AB | 890 | 5020.341679 | 2728.237159 | 3087.670952 | 1990 | 2 | CA | 295 | 1 | BA | 1 | 0 | 31 | 0 | 1500.0 | 0 | INTERNET | 3.775225 | macintosh | 1 | 1 | 0 | 7 |
# viendo cuantas filas y columnas tiene el dataframe
df_fraud_original.shape
(1000000, 32)
# comprobando que el numero de columnas es == 32
len(df_fraud_original.columns)
32
# hay 6 columnas que contienen datos nulos que han sido reemplazados por valores negativos, por lo tanto, los volvemos a \
# clasificar como nulos (np.nan) para poder analizarlos más facilmente
df_fraud = df_fraud_original.copy()
for i in list(df_fraud.columns):
if i in ["prev_address_months_count", "current_address_months_count", "bank_months_count",
"session_length_in_minutes", "device_distinct_emails_8w"]:
df_fraud[i] = np.where(df_fraud[i] == -1, np.nan, df_fraud[i])
df_fraud["intended_balcon_amount"] = np.where(df_fraud["intended_balcon_amount"] < 0,
np.nan, df_fraud["intended_balcon_amount"])
df_fraud
| fraud_bool | income | name_email_similarity | prev_address_months_count | current_address_months_count | customer_age | days_since_request | intended_balcon_amount | payment_type | zip_count_4w | velocity_6h | velocity_24h | velocity_4w | bank_branch_count_8w | date_of_birth_distinct_emails_4w | employment_status | credit_risk_score | email_is_free | housing_status | phone_home_valid | phone_mobile_valid | bank_months_count | has_other_cards | proposed_credit_limit | foreign_request | source | session_length_in_minutes | device_os | keep_alive_session | device_distinct_emails_8w | device_fraud_count | month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.9 | 0.166828 | NaN | 88.0 | 50 | 0.020925 | NaN | AA | 769 | 10650.765523 | 3134.319630 | 3863.647740 | 1 | 6 | CA | 185 | 0 | BA | 1 | 0 | 24.0 | 0 | 500.0 | 0 | INTERNET | 3.888115 | windows | 0 | 1.0 | 0 | 7 |
| 1 | 1 | 0.9 | 0.296286 | NaN | 144.0 | 50 | 0.005418 | NaN | AB | 366 | 534.047319 | 2670.918292 | 3124.298166 | 718 | 3 | CA | 259 | 1 | BA | 0 | 0 | 15.0 | 0 | 1500.0 | 0 | INTERNET | 31.798819 | windows | 0 | 1.0 | 0 | 7 |
| 2 | 1 | 0.9 | 0.044985 | NaN | 132.0 | 40 | 3.108549 | NaN | AC | 870 | 4048.534263 | 2893.621498 | 3159.590679 | 1 | 14 | CB | 177 | 1 | BA | 0 | 1 | NaN | 0 | 200.0 | 0 | INTERNET | 4.728705 | other | 0 | 1.0 | 0 | 7 |
| 3 | 1 | 0.9 | 0.159511 | NaN | 22.0 | 50 | 0.019079 | NaN | AB | 810 | 3457.064063 | 4054.908412 | 3022.261812 | 1921 | 6 | CA | 110 | 1 | BA | 0 | 1 | 31.0 | 1 | 200.0 | 0 | INTERNET | 2.047904 | linux | 0 | 1.0 | 0 | 7 |
| 4 | 1 | 0.9 | 0.596414 | NaN | 218.0 | 50 | 0.004441 | NaN | AB | 890 | 5020.341679 | 2728.237159 | 3087.670952 | 1990 | 2 | CA | 295 | 1 | BA | 1 | 0 | 31.0 | 0 | 1500.0 | 0 | INTERNET | 3.775225 | macintosh | 1 | 1.0 | 0 | 7 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 999995 | 0 | 0.6 | 0.192631 | NaN | 104.0 | 40 | 0.030592 | NaN | AB | 804 | 7905.711839 | 8341.468557 | 4972.635997 | 1 | 8 | CA | 75 | 1 | BC | 1 | 1 | 25.0 | 0 | 200.0 | 0 | INTERNET | 8.511502 | linux | 1 | 1.0 | 0 | 4 |
| 999996 | 0 | 0.8 | 0.322989 | 148.0 | 9.0 | 50 | 1.628119 | NaN | AC | 3306 | 5391.470463 | 4955.170808 | 5022.728108 | 0 | 2 | CC | 154 | 1 | BC | 1 | 1 | NaN | 0 | 200.0 | 0 | INTERNET | 8.967865 | windows | 0 | 1.0 | 0 | 4 |
| 999997 | 0 | 0.8 | 0.879403 | NaN | 30.0 | 20 | 0.018563 | 34.692760 | AA | 1522 | 8063.102636 | 5670.654316 | 4377.196321 | 2023 | 6 | CF | 64 | 0 | BC | 0 | 1 | 11.0 | 0 | 200.0 | 0 | INTERNET | 8.195531 | other | 0 | 1.0 | 0 | 4 |
| 999998 | 0 | 0.9 | 0.762112 | NaN | 189.0 | 20 | 0.015352 | 94.661055 | AA | 1418 | 8092.641762 | 3982.582204 | 4394.803296 | 1678 | 6 | CA | 163 | 0 | BA | 1 | 0 | 28.0 | 0 | 500.0 | 0 | INTERNET | 4.336064 | windows | 1 | 1.0 | 0 | 4 |
| 999999 | 0 | 0.2 | 0.697452 | NaN | 321.0 | 20 | 2.655916 | 9.908499 | AA | 951 | 6169.630036 | 3695.308261 | 4352.334543 | 2 | 12 | CA | 36 | 1 | BE | 0 | 1 | 15.0 | 0 | 200.0 | 0 | INTERNET | 6.717022 | linux | 0 | 1.0 | 0 | 4 |
1000000 rows × 32 columns
# averiguando si tenemos filas duplicadas
print(df_fraud.shape, df_fraud.drop_duplicates().shape)
(1000000, 32) (1000000, 32)
df_fraud[["payment_type"]]
| payment_type | |
|---|---|
| 0 | AA |
| 1 | AB |
| 2 | AC |
| 3 | AB |
| 4 | AB |
| ... | ... |
| 999995 | AB |
| 999996 | AC |
| 999997 | AA |
| 999998 | AA |
| 999999 | AA |
1000000 rows × 1 columns
# visualizando que tipo de datos estan en cada columna y cuantas columnas tenemos de cada tipo
df_fraud.dtypes.sort_values().to_frame("data_type").groupby("data_type").size().to_frame("count").reset_index()
| data_type | count | |
|---|---|---|
| 0 | int64 | 14 |
| 1 | float64 | 13 |
| 2 | object | 5 |
df_fraud.dtypes.sort_values().to_dict()
{'fraud_bool': dtype('int64'),
'keep_alive_session': dtype('int64'),
'foreign_request': dtype('int64'),
'has_other_cards': dtype('int64'),
'phone_mobile_valid': dtype('int64'),
'phone_home_valid': dtype('int64'),
'email_is_free': dtype('int64'),
'credit_risk_score': dtype('int64'),
'device_fraud_count': dtype('int64'),
'date_of_birth_distinct_emails_4w': dtype('int64'),
'bank_branch_count_8w': dtype('int64'),
'month': dtype('int64'),
'zip_count_4w': dtype('int64'),
'customer_age': dtype('int64'),
'velocity_4w': dtype('float64'),
'device_distinct_emails_8w': dtype('float64'),
'income': dtype('float64'),
'session_length_in_minutes': dtype('float64'),
'name_email_similarity': dtype('float64'),
'proposed_credit_limit': dtype('float64'),
'prev_address_months_count': dtype('float64'),
'bank_months_count': dtype('float64'),
'current_address_months_count': dtype('float64'),
'days_since_request': dtype('float64'),
'intended_balcon_amount': dtype('float64'),
'velocity_6h': dtype('float64'),
'velocity_24h': dtype('float64'),
'housing_status': dtype('O'),
'source': dtype('O'),
'device_os': dtype('O'),
'payment_type': dtype('O'),
'employment_status': dtype('O')}
df_fraud_bool = df_fraud["fraud_bool"]\
.value_counts(normalize = True)\
.mul(100).rename("percent")
df_fraud_bool_pc = df_fraud["fraud_bool"].value_counts()
df_fraud_bool_pc = pd.merge(df_fraud_bool, df_fraud_bool_pc, left_index = True, right_index = True, how='inner')
df_fraud_bool_pc
| percent | count | |
|---|---|---|
| fraud_bool | ||
| 0 | 98.8971 | 988971 |
| 1 | 1.1029 | 11029 |
Podemos ver que el 98.90% de las instancias no son cuentas bancarias que se han utilizado para cometer fraude.
df_fraud_bool = df_fraud["fraud_bool"]\
.value_counts(normalize = True)\
.mul(100).rename("percent").reset_index()
df_fraud_bool_pc = df_fraud["fraud_bool"].value_counts().reset_index()
df_fraud_bool_pc = pd.merge(df_fraud_bool, df_fraud_bool_pc, on = "fraud_bool", how = 'inner')\
.rename(columns = {"fraud_bool": "index"})
df_fraud_bool_pc
| index | percent | count | |
|---|---|---|---|
| 0 | 0 | 98.8971 | 988971 |
| 1 | 1 | 1.1029 | 11029 |
fig = px.histogram(df_fraud_bool_pc, x = "index", y = ['percent'])
fig.update_xaxes(tickvals = [0, 1])
fig.show()
El gráfico de arriba visualiza la cantidad de fraude que hay en el conjunto de datos, una cantidad muy pequeña respecto al conjunto de datos en total.
df_series_null_rows = df_fraud.isnull().sum(axis = 1).sort_values(ascending = False)
df_series_null_columns = df_fraud.isnull().sum().sort_values(ascending = False) # axis = 0 is implied
print(df_series_null_rows.shape, df_series_null_columns.shape)
(1000000,) (32,)
df_null_rows = pd.DataFrame(df_series_null_rows, columns = ["null_rows"])
df_null_columns = pd.DataFrame(df_series_null_columns, columns = ["null_columns"])
df_null_rows["target"] = df_fraud["fraud_bool"].copy()
df_null_rows["row_percentage"]= df_null_rows["null_rows"] / df_fraud.shape[1]
df_null_columns["column_percentage"] = df_null_columns["null_columns"] / df_fraud.shape[0]
df_null_columns.head(6)
| null_columns | column_percentage | |
|---|---|---|
| intended_balcon_amount | 742523 | 0.742523 |
| prev_address_months_count | 712920 | 0.712920 |
| bank_months_count | 253635 | 0.253635 |
| current_address_months_count | 4254 | 0.004254 |
| session_length_in_minutes | 2015 | 0.002015 |
| device_distinct_emails_8w | 359 | 0.000359 |
La tabla de arriba enseña la cantidad de valores nulos que hay en cada columna, y el porcentaje de esa columna que componen.
threshold = 0.9
list_vars_not_null = list(df_null_columns[df_null_columns["column_percentage"] < threshold].index)
df_fraud_filter_null = df_fraud.loc[:, list_vars_not_null]
df_fraud_filter_null.shape
(1000000, 32)
df_fraud.shape
(1000000, 32)
categorical_vars_list, other = f_aux.dame_variables_categoricas(dataset = df_fraud_filter_null)
df_fraud_filter_null[categorical_vars_list] = df_fraud_filter_null[categorical_vars_list].astype("category")
df_fraud_filter_null[categorical_vars_list].head()
| fraud_bool | device_os | source | housing_status | employment_status | payment_type | |
|---|---|---|---|---|---|---|
| 0 | 1 | windows | INTERNET | BA | CA | AA |
| 1 | 1 | windows | INTERNET | BA | CA | AB |
| 2 | 1 | other | INTERNET | BA | CB | AC |
| 3 | 1 | linux | INTERNET | BA | CA | AB |
| 4 | 1 | macintosh | INTERNET | BA | CA | AB |
La tabla de arriba son las cinco primeras filas de las columnas de tipo categóricas del conjunto de datos, mientras que la tablde de abajo son las cinco primeras filas de las columnas de tipo cuantitativas.
df_fraud_filter_null[other].head()
| foreign_request | phone_mobile_valid | has_other_cards | keep_alive_session | device_fraud_count | phone_home_valid | credit_risk_score | email_is_free | date_of_birth_distinct_emails_4w | bank_branch_count_8w | zip_count_4w | customer_age | month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 1 | 185 | 0 | 6 | 1 | 769 | 50 | 7 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 259 | 1 | 3 | 718 | 366 | 50 | 7 |
| 2 | 0 | 1 | 0 | 0 | 0 | 0 | 177 | 1 | 14 | 1 | 870 | 40 | 7 |
| 3 | 0 | 1 | 1 | 0 | 0 | 0 | 110 | 1 | 6 | 1921 | 810 | 50 | 7 |
| 4 | 0 | 0 | 0 | 1 | 0 | 1 | 295 | 1 | 2 | 1990 | 890 | 50 | 7 |
categorical_vars_list
['fraud_bool', 'device_os', 'source', 'housing_status', 'employment_status', 'payment_type']
A continuación, analizando algunas columas del conjunto de datos, vemos q porcentaje de cada columna esta hecho de cada valor único que aparece.
df_fraud_filter_null["fraud_bool"].value_counts()
fraud_bool 0 988971 1 11029 Name: count, dtype: int64
df_fraud_filter_null["income"].value_counts()
income 0.9 221419 0.1 157449 0.8 146650 0.6 111973 0.7 105109 0.4 81364 0.2 69345 0.5 55858 0.3 50833 Name: count, dtype: int64
df_fraud_filter_null["customer_age"].value_counts()
customer_age 30 311433 20 245855 40 238712 50 140353 60 34770 10 20987 70 6517 80 1297 90 76 Name: count, dtype: int64
df_fraud_filter_null[categorical_vars_list].dtypes
fraud_bool category device_os category source category housing_status category employment_status category payment_type category dtype: object
print(df_fraud_filter_null["payment_type"].value_counts().count())
df_fraud_filter_null["payment_type"]\
.apply(lambda x: str(x).lower().strip()).value_counts(normalize=True)#.count()
5
payment_type ab 0.370554 aa 0.258249 ac 0.252071 ad 0.118837 ae 0.000289 Name: proportion, dtype: float64
print(df_fraud_filter_null["customer_age"].value_counts().count())
df_fraud_filter_null["customer_age"]\
.apply(lambda x: str(x).lower().strip()).value_counts(normalize=True)
9
customer_age 30 0.311433 20 0.245855 40 0.238712 50 0.140353 60 0.034770 10 0.020987 70 0.006517 80 0.001297 90 0.000076 Name: proportion, dtype: float64
df_fraud_filter_null["payment_type"]
0 AA
1 AB
2 AC
3 AB
4 AB
..
999995 AB
999996 AC
999997 AA
999998 AA
999999 AA
Name: payment_type, Length: 1000000, dtype: category
Categories (5, object): ['AA', 'AB', 'AC', 'AD', 'AE']
df_fraud_filter_null["income"][df_fraud_filter_null["income"] == 0.6000000000000001] = 0.6
df_fraud_filter_null["income"][df_fraud_filter_null["income"] == 0.7000000000000001] = 0.7
C:\Users\javis\AppData\Local\Temp\ipykernel_12356\3469796436.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\javis\AppData\Local\Temp\ipykernel_12356\3469796436.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
for i in list(df_fraud_filter_null.columns):
if df_fraud_filter_null[i].dtype == "int64":
df_fraud_filter_null[i] = df_fraud_filter_null[i].astype("float64")
df_fraud_filter_null.to_csv("../data/df_data_initial_preprocessing.csv")